Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase the input block size for bgzip. #1768

Merged
merged 2 commits into from
Nov 18, 2024

Conversation

jkbonfield
Copy link
Contributor

Commit e495718 changed bgzip from unix raw POSIX read() calls to hread(). Unfortunately hread gets its buffer size from stat of the input file descriptor, which can be 4kb for a pipe. We're reading 0xff00 bytes, so this ends up being split over two reads mostly, with one or both involving additional memcpys. This makes the buffered I/O worse performing than non-buffered. In the most extreme cases (cat data | bgzip -l0 > /dev/null) this is a two fold slow down.

The easy solution is just to increase the buffer size to something sensible. It's a little messy as we have to use hfile_internal.h to get hfile_set_blksize, but it works. I'm not sure why we didn't elect to make that API more public. Probably simply out of caution.

Fixes #1767

@jkbonfield
Copy link
Contributor Author

Edited the buffer down to 256Kb instead of 1Mb as it seems to be sufficient (tested on tmpfs, fast NFS and lustre).

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Apr 10, 2024

Hmm, it's still variable! The effect of a bigger block size is more memcpy as we have fewer direct reads (as @daviesrob points out a readv can partially solve that as we can read direct to caller buff + remainder to look-ahead cache, but it doesn't fit with the backend semantics), but it also reduces the impact of many small reads due to small pipe size.

Hence some machines it's slower to have a bigger block size, and it can have weird interactions with CPU load too which I cannot explain. Eg it's a win at -l0, but a penalty at -l1. It needs more head scratching probably.

@daviesrob daviesrob self-assigned this May 9, 2024
@jkbonfield
Copy link
Contributor Author

jkbonfield commented Oct 29, 2024

I retested this with different block sizes on a few systems. This time it was aggregate wall clock time on 100 trials of the same file, so cached and a lot of reproducability. Bgzip -l5, but our systems all have CPU frequency scaling on and the filesystems are sometimes shared so there are sources of error and randomness too. Even so, the charts are interesting.

image

image

So 128k is enough on the Intel and 256k is enough on the AMD. I also tried an older Intel machine, but it just flatlined the CPU and made very little difference in buffer sizes. Trying -l1 instead changed it a little, but not significantly.

I also tested reading an actual file rather than a pipe from Lustre on the AMD system and basically it was identical performance regardless of buffer size. This backs up my findings in the linked issue. Looking at CPU instead of elapsed time shows a small drop at around 128KB onwards, but not huge and not worth taking into account I suspect.

However I think the 256k here is fine, as would 128kb so we should consider this PR again. We could also consider doing an fstat and only applying it on pipes.

@daviesrob
Copy link
Member

I think 128kB blocks are better, but would it be a good idea to put this in hFILE so all pipe users could benefit? I think it should probably only be for FIFOs though to avoid potential issues with over-reading on index-based jobs. The current limit of 32k in hfile_init would need to be increased, but I doubt making it 128k instead would be a major issue.

Commit e495718 changed bgzip from unix raw POSIX read() calls to
hread().  Unfortunately hread gets its buffer size from stat of the
input file descriptor, which can be 4kb for a pipe.  We're reading
0xff00 bytes, so this ends up being split over two reads mostly, with
one or both involving additional memcpys.  This makes the buffered I/O
worse performing than non-buffered.  In the most extreme cases (cat
data | bgzip -l0 > /dev/null) this is a two fold slow down.

The easy solution is just to increase the buffer size to something
sensible.  Currently we play it cautiously and only do this on pipes
and fifos.

Fixes samtools#1767
@jkbonfield
Copy link
Contributor Author

It's not possible to distinguish between normal pipes and named pipes (FIFOs), but I don't think that matters much and it still works OK on named pipes.

The real problem is the buffer size is pathetic at 4096 according to stat. This is actually smaller than the standard linux pipe size. It's still internally cachine at 64kb, but we're reading it in small chunks. I suspece that's because it's the size the OS supports for atomic writes, but we're optimise for speed here.

I've moved the change to hfile instead.

hfile.c Show resolved Hide resolved
On AMD EPYC 7713 aligning to cache size boundaries makes a very
significant difference to fp->backend->read performance in the
kernel.  A modern Intel CPU did not demonstrate this difference.

x86 often have cache line size of 64 bytes, and apple Arm chips 128
bytes.  I haven't tested if Arm benefits from alignment during read
calls, but we can check size with sysconf(_SC_LEVEL1_DCACHE_LINESIZE).
However to avoid additional autoconfery I just picked 256 as it gives
us headroom and is simple.

Speed ups on the AMD EPYC:

time bash -c 'for i in `seq 1 30`;do cat < ~/lustre/enwik9| ./bgzip -l5 -@32 > /dev/null;done'

Unaligned
real    0m45.012s
user    10m7.661s
sys     0m58.770s

Aligned
real    0m30.717s
user    11m14.004s
sys     0m32.921s

It is likely this could improve other bits of code too.
@daviesrob daviesrob merged commit 0fdba48 into samtools:develop Nov 18, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bgzip performance drop from v1.16
2 participants